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Simple Math is Enough 

...Mathematical depth and 
elegance are highly desirable, 
but often simple mathematics, 
artfully applied, is the key to 
success. 

— Richard M. Karp 






meaning making of genomic data 


• Genomic data 

- Two-hybrid protein-protein interactions 

- DNA microarray mRNA transcription 
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Clustering: a group of genes is selected based 
on the similarity in their expression under 
different stimulations 

Common motifs: search for shared nucleotide 
motifs in DNA sequences a few hundred bp 
before the transcription start of each gene 


Example: CRCGAAA 


R=A or G 


Tavazoie el al. Nat. Gen. 22 213. 


meaning making of genomic data 


• Genomic data 

~ Two-hybrid protein-protein interactions 
- DNA microarray rriRNA transcription 

• High rate of error in current technologies 

• Think some aspect of data that are both non-random and 
biologically meaningful 

• Compute a p-value associated with such non-random 
feature and use it to weed out the false positive errors 
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Protein-protein interactions: 
non-random features 
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Jeong et al., Nature (2001) 411:41-2. 


In this talk.. 


A method of suggesting protein functions 
based on protein-protein interaction data. 

- Samanta, M., Liang, S, Proc Natl Acad Sci USA. 
(2003) 100 , 12579-12583. 

A method of extracting protein-binding 
DNA motifs from a single microarray 
experiment. 

- Bussemaker et al. Nat Genet. (2001 ) 27 167-171. 

- Work in progress 






Yeast two-hybrid assay 
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Guessing function is difficult 


ADR1 

Proteins it 
interacts with: 


ADA2 

trans. adaptor or co-activator 

GCN5 

histone acetyltransferase 

SPT15 

TATA binding protein TBP 

SUA7 

TFIIB subunit 

TAP 145 

TFIID subunit 

TAF25 

TFIID and SAGA subunit 

ARP2 

actin-iike protein 

BMH1 

signaling protein 

TAF60 

TFIID and SAGA subunit 

HRT1 

similarity to Lotus RINGTinger protein 

K API 04 

beta-karyopherin 

PPT1 

protein ser/thr phosphatase 

SHOI 

HOG1 high-osmo. signal transduction pathway 

YKU80 

ComponenhDNA end-joining repair pathway 

RPC40 

DNA-directed RNA pol. 1, III subunit 

COP1 

alpha chain of secretory pathway vesicfes 

TAF90 

TFIID and SAGA subunit 


Prediction of protein function is difficult 
from the raw data 

Example 2: 

YDL246C: function unknown (SGD database) 


Proteins it 
interacts with: 


PH085 

Phosphate & glucose metabolism 

PSE1 

Nuclear transport of protein 

SOR1 

Sorbitol dehydrogenase 

SRP1 

Protein transport 

YJR037W 

Unknown 

TEM1 

Signaling protein 







We derive p-value based on 
two proteins having a large number of 
interaction partners in common 

Protein 1 interacts with n, partners; Protein 2 
interacts with n 2 partners. 

The probability P of having m partners in 
common 
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counting problem #1 : 


Distinct ways for protein 1 to 
have n 1 interacting partners 
is 
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Similarly for protein 2 
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counting problem #2: 

The protein 1 and protein 2 have m 
interacting partners in common. 
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n r m remaining partner for protein 1 


n 2 -m remaining partner for protein 2 



We derive p-value based on 
two proteins having a large number of 
interaction partners in common 


Protein 1 interacts with n t partners; Protein 2 
interacts with n 2 partners. 

The probability P of having m partners in 
common 
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Associations of ADR1 from our method 


Prot. 

Log(p) 

Function of protein 

TAF61 

-10.74 

TFIID and SAGA subunit | 

NGG1 

-9.85 

general transcriptional adaptor or co-activator 

TAF60 

-9.33 

TFIID and SAGA subunit 

ADA2 

-9.33 

general transcriptional adaptor or co-activator 

GON4 

-9 19 

transcriptional activator of amino acid biosynthetic genes 

TAF17 

-8.86 

TFIID and SAGA subunit 

SPT7 

-8.3 

involved in alteration of transcription start site selection 

TSM1 

-8.09 

component of TFIID complex 

SPT20 

-7.83 

member of the TBP class of SPT proteins that alter transcription 
site selection 

SPT15 

-7.44 

the TATA-blnding protein TBP 

TAF90 

-7.36 

TFIID and SAGA subunit 

TAF19 

-7.08 

TFIID subunit (TBP-associated factor), 19 kD 

GAL4 

-6.94 

transcription factor 



Example 2: YDL246C 


Raw interaction 
data: 


YDL246C: function unknown (SGD database) 


PH085 

Phosphate & glucose metabolism 

PSE1 

Nuclear transport of protein 

S0R1 

Sorbitol dehydrogenase 

SRP1 

Protein transport 

YJR037W 

Unknown 

TEM1 

Signaling protein 


Proteins Sharing Partners with YDL246C (using our algorithm): 


SOR1 

Sorbitol dehydrogenase 

-13 [log(p)] 

HSP10 

Heat-shock protein 

-6 (too small) 


http://www.nas.nasa.gov/bio/ 





By clustering we can recover complexes and pathways 

202 modules are reconstructed covering most aspects of cell. 

YGL198W ? 

YGL161C ? 

Y1F1 (ER-Golgi transport) 
GDI1 (ER-Golgi transport) 


We predicted functions of 81 unannotated proteins. 
22 out 23 are now known to be correct. 

YDL246C: same function as SOR1 (sorbitol 
dehvdroaenease) 
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GAL11 (trans. mediator) 
ROX3 (trans. mediator) 
SRB6 (trans. mediator) 
M HD2 (trans. mediator) 
MED7 (trans. mediator) 
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predicted functions of 81 unannotated proteins. 
(22 out 23 are now known to be correct) 


Protein 

YFR024C-A (YSC85), YHR1 MW (BZZ1)~, YNL094W 

(APP1 ), YMRi 92W (APP2) 

YGR268C (HUAI), YOR284W (HUA2), YPR171W 
(BSP1) 

YJR083C (ACF4) 

YDR036C (EHD3) 

YKL2I 4C ('YRA2)* 

YNL207W (RIO 2) 

YLR409C (UTP2 1 ), YKR060W (UTP30), YGR090W 
(UTP22), YER082C(UTP7 )\ YJL069C(UTP18}*, 

YBR247C (ENP1 ) 

YMR288W (HSH1 55) 

YHR197W (RIX1), YNL182C (IPI3), YLRI06C 
(MDN1 )* 

YGR128C (XJTP8) 

YGR21 5W (RSM27)*, YGL129C (RSM23)' ~ 

YDL2 1 3C (N0P6) 

YNL306W (MRPSiS) 

WRU4C(UTPl 9), YDL148C (NOP 1 4)\ YLR1 86W 
(EMG1 ), YJL109C OJTP1 0)*, YBL004W (UTP20) 
YGL099W (LSG I )*, VDR 1 0 \ C (ARX 1 ) ~ 

YOL077C (BRXI), YOR206W (NOC2), YNL135C 

(FPRI ) 

YOR145C (DIM2) 


Predicted function 

Actin filament organization 
Aetin patch assembly 

Actin cytoskeletoa organization and biogenesis , 

Protein biosynthesis in mitochondrial small nbosomaJ subunit 

mRNA processing/RNA metabolism 

Nucleolar protein involved in 40S ribosomal biogenesis 

Associated with U3 snoRNA and 20S iRNA biosynthesis 


SnRNA binding involved in mRNA splicing 

Ribosomal large subunit assembly and maintenance 

Processing of 2QS pre-rRNA 

Structural constituent of ribosome 

rRNA processingriranscription elongation 

Mitochondrial small nbosomai subunit 

SnoRNA binding, 35S primary transcript processing 

! 

27S pre-rRNA ribosomal subunit 

Biogenesis and transport of ribosome 

35S Primary transcript processing and rRNA modification 


Protein 

YEL015W (DCP3) 

~~ ~ YDL002C (NHP1 OX YLR 176C (RFX1 7 
YDR469W(SDC1)’ 

\TL070W (MUK1 ) 

YLR427W (MAG2) 

YDL076C (RXT3), YIU 1 2W (HOS4) 

YNL265C (IST1) 

YLR192C (HCR1)' 

TOL074C(BRJE1) 

YGR156W (PTT1 ) , YKL059C (MPE1) 

YGR089W (NNF2) 

YGL161CCYIP5), YGH98W (YIP4) 

YPL246C (RBD2), YJL151C (SNA3), YGL104C (VPS73) 

1201 YKR030W (MSG 1) 

YBR098W (MMS4) 

YHR105W (YPT33) 

YBL049W (MOH 1 ), YCL039W (MOH2) 

YDL246C (SOR2) 

YMR322C (SNQ4) 

YDR430C (CYM1 ) 

YJL199C (MBB1 ), YPL004C (LSPI), YGR086C (PILI ) 

YLR097C (HRT3) 

YKR046C (PET 10) 

YEL017W (GTT3) 

YGL133W (ITC1 ) 

YGR161C (RTS3) 

YOR144C (EFD1 ) 

YML1 1 7W (NAB6) 

YLR432W (IMD3) 

YKL095W (YTU2), YGR278W (CWC22), YDL209C 

(CWC2)* 

YGR232W (NAS6)*, YGL004C (RPN1 4)/YLR421C 
(RPN13)* 


Predicted function 

Deadenyiatioa dependent decapping and mRNA catabolism 

Modification of chromatin architecture/transcription 

Chromatin silencing and histone methylation 

Transcription factor (or its carrier) 

DNA AT-glycosylase involved in DNA dealkylation 

Histone deacetylase complex involved in chromatin silencing 
Transcription initiation factor 

Translation initiation as part of elF3 complex 

Chromosome condensation and segregation process 

mRNA cleavage and polyadenylation specificity factor 

Chromosome segregation (spindle pole) and mitosis 

Vescicle mediated transport 

Cell wall synthesis/protein- vacuolar targeting 

Golgi toendosome transport and vescicle organization 

Golgi to vacuolar transport 

Both same function. Possibly lipked with vacuolar transport 

Possibly involved in fructose and mannose metabolism 

Pyridoxine metabolism 

Protein involved in pyurvate metabolism 

Metabolic protein 

Nuclear ubiquitine ligase 

ATP/ADP exchange 

Protein linked with glutathione metabolism 

Chromatin remodeling 

Protein phosphatase 2A complex 

DNA replication and repair 

Nuclear RNA binding 

RNA helicase involved in mRNA splicing 

Spliceosome complex involved in mRNA splicing 

Proteasome complex 
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Our method is very robust from noise !! 



We added 50% random noise, we stiii recover 90% of 
top 2800 associations. 

The method is not biased toward proteins with large 
interaction partners. JSN1 has the largest interaction 
partners, yet none of top associations involves JSN1. 



summery 


i) Non-random features in the genomic data are usually 
biologically meaningful. The key is to choose the feature well. 
Having a p-value based score prioritizes the findings. 

ii) if two proteins share a unusually large number of common 
interaction partners, they tend to be involved in the same 
biological process. We used this finding to predict the functions 
of 81 un-annotated proteins in yeast. 







chIP chip experiments 

A transcription factor (TF) is 
engineered to contain a tag 

Enriched DNA fragments that 
binds to the TF are pull out and 
compared to the background 
without enrichment. 

Using DNA chips, preferred 
binding sites are identified, 
genome-wide, to within a few 
hundred nucleotides. 


Find the binding motif. 

Ren et ai Science (2000); Iyer et al. Nature 4D9 533 











improvements 

• Allow motifs to be fuzzy 

- Motif may contain a small number of IUPAC 
characters: S(CG), W(ATV K(GT) S M(AC) ; 
R(AG), Y(CT)'. 

• Transcription factors are known to bind to 
fuzzy motifs. Therefore with IUPAC the 
motif are more realistic. 



Fuzzy motifs require 
much more computations 


• For L=10, there are 4 L =10 6 motifs. Each takes M 
G calculations, where G (=6000) is # of genes; 

M (=500) is # of nucleotides. 


• For m IUPAC characters, add another factor of 

fi\r nr 
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=3500 (form=3) additional motifs. 


• We explore sparse ness of the count matrix as 
well as by storing certain intermediate results to 
achieve several hundred-fold speedup. 






DNA origin of replication signals 
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